Compare two dataframes with different number of rows

df1=

  A   B  C  D

  a1  b1 c1 1

  a2  b2 c2 2

  a3  b3 c3 4

df2=

  A   B  C  D

  a1  b1 c1 2

  a2  b2 c2 1

I want to compare the value of the column 'D' in both dataframes. If both dataframes had same number of rows I would just do this.

newDF = df1['D']-df2['D']

However there are times when the number of rows are different. I want a result Dataframe which shows a dataframe like this.

resultDF=

  A   B  C  D_df1 D_df2  Diff

  a1  b1 c1  1     2       -1

  a2  b2 c2  2     1        1

EDIT: if 1st row in A,B,C from df1 and df2 is same then and only then compare 1st row of column D for each dataframe. Similarly, repeat for all the row.

marc_s

714k171 gold badges1315 silver badges1433 bronze badges

asked Aug 12, 2019 at 23:52

1

Use merge and df.eval

df1.merge(df2, on=['A','B','C'], suffixes=['_df1','_df2']).eval('Diff=D_df1 - D_df2')

Out[314]:
    A   B   C  D_df1  D_df2  Diff
0  a1  b1  c1      1      2    -1
1  a2  b2  c2      2      1     1

answered Aug 13, 2019 at 0:13

Compare two dataframes with different number of rows

Andy L.Andy L.

24.4k4 gold badges16 silver badges27 bronze badges

0

Photo by Ilona Froehlich on Unsplash

I Will Demonstrate Several Ways And Explain The Pros and Cons Of Each One

Pandas dataframes are the workhorse of data science. While they’re not the most glamorous aspect of the field, if you asked me to pick the single most important thing for a data scientist to master, it would be the ability to skillfully build and manipulate dataframes.

In a past post, I described dataframes as part Excel spreadsheet and part SQL table but with all the versatility and analytical power of Python. Honestly, I am not yet an expert Pandas user, but I aim to be one. That’s why whenever I learn something new and useful, I will try to document it here.

Today’s subject is comparing dataframes for equality (or inequality). Often when working with data stored in dataframes, we will need to know whether they are the same. And to highlight the differences if they are not. Use cases include:

  • Quickly checking your dataframe against the master copy.
  • If you downloaded an updated version of an existing dataset, you may want to identify any new rows or updated cells.

Method 1: The .eq Method

Pandas dataframes come with the handy .eq method. It lets you quickly compare two dataframes and highlights any cells that are different. For example, let’s say we have some data on NBA players and the number of championships that they won (rings).

Now let’s say a friend is doing a research project on basketball and asked us to check his data (he is less of an NBA fan than we are). Our data is in array_1 and df_1, and the data that our friend asked us to check is in array_2 and df_2:

# Our data
array_1 = np.array([['LeBron',3],
['Kobe',5],
['Michael',6,],
['Larry',3],
['Magic',5],
['Tim',4]])
df_1 = pd.DataFrame(array_1,
columns=['Player','Rings'])
# Data from friend
array_2 = np.array([['LeBron',3],
['Kobe',3],
['Michael',6,],
['Larry',5],
['Magic',5],
['Tim',4]])
df_2 = pd.DataFrame(array_2,
columns=['Player','Rings'])

We can use the .eq method to quickly compare the dataframes. The output of .eq lists out each cell position and tells us whether the values in that cell position were equal between the two dataframes (note that rows 1 and 3 contain errors).

In:
df_1.eq(df_2)
Out:
Player Rings
0 True True
1 True False
2 True True
3 True False
4 True True
5 True True

We can use boolean indexing and the .all method to print out the rows that are erroneous. Boolean indexing is using a set of conditions to decide which rows to print (the rows where our boolean index is equal to True get printed).

In:
# .all returns True for a row if all values are True
df_1.eq(df_2).all(axis=1)Out:
0 True
1 False
2 True
3 False
4 True
5 True
# Above the rows marked True are the ones where all values equal.
# We actually want the rows marked False
In:
# Note that we specify the axis to let Pandas know that we care
# about equality across all the columns in a row
df_2[df_1.eq(df_2).all(axis=1)==False]Out:
Player Rings
1 Kobe 3
3 Larry 5

We find that our friend’s data on Kobe and Larry are wrong (Kobe is actually the one with 5 rings). Since we are looking for cells where the values do not equal, we can actually do this a bit more concisely using the .ne method (.ne stands for not equal) and .any:

In:
# .any returns true for row if any of the row's values are True
df_2[df_1.ne(df_2).any(axis=1)]Out:
Player Rings
1 Kobe 3
3 Larry 5

So one thing that jumps out is that both .eq and .ne work better when we are comparing dataframes of equal dimensions. But what if they are not? For example, let’s say our friend’s dataframe has more rows than ours (added KG and Charles).

# Data from friend
array_3 = np.array([['LeBron',3],
['Kobe',3],
['Michael',6,],
['Larry',5],
['Magic',5],
['Tim',4],
['KG',1],
['Charles',0]])
df_3 = pd.DataFrame(array_3,
columns=['Player','Rings'])

Now, let's use .ne to find the differences. The .ne method has identified differences in rows 1, 3, 6, and 7.

In:
df_1.ne(df_3)
Out:
Player Rings
0 False False
1 False True
2 False False
3 False True
4 False False
5 False False
6 True True
7 True True

If we try to use boolean indexing on df_1 to print the differences, Python gives us a warning (because our boolean index is longer than df_1) and prints only the rows for Kobe and Larry (our originally identified differences). It doesn’t print out KG and Karl because they are not in df_1. To print all of them, we need to use boolean indexing on df_3:

In:
df_3[df_1.ne(df_3).any(axis=1)]
Out:
Player Rings
1 Kobe 3
3 Larry 5
6 KG 1
7 Charles 0

What If The Indexes Differ?

So it works when the dataframes are of different lengths. But what if the indexes differ? The index is a critical part of the dataframe, it’s basically the name of a row and how we refer to the row when we need to obtain its data. When the indexes between two dataframes differ (even if the actual contents in the cells are identical), the .eq method will regard them as different entities. Let’s make a new dataframe with an alphabetical index instead of a numeric one:

In:
# Array with alphabetical index
df_4 = pd.DataFrame(array_3,
index=['a','b','c','d','e','f','g','h'],
columns=['Player','Rings'])
print(df_4)
Out:
Player Rings
a LeBron 3
b Kobe 3
c Michael 6
d Larry 5
e Magic 5
f Tim 4
g KG 1
h Charles 0

Now let’s try .eq. The output is a long and unhelpful dataframe. It has as many rows as our two dataframes, df_1 (6 rows) and df_2 (8 rows), combined. It does this because even if the cell contents are identical, the indexes are not so .eq assumes that nothing is the same across the two dataframes.

In:
df_1.eq(df_4)
Out:
Player Rings
0 False False
1 False False
2 False False
3 False False
4 False False
5 False False
a False False
b False False
c False False
d False False
e False False
f False False
g False False
h False False

Let’s see how to get around this issue.

Comparing Dataframes With Different Indexes

The easiest way is to just reset the indexes. In this case, we just need to reset df_3’s index (it will go from alphabetical back to a numeric one that starts at 0). Don’t forget to drop the index so you don’t end up with an extra column after the reset.

As you can see, thanks to resetting the index, the output is back to the wrong entries (Kobe and Larry) as well as the new ones (KG and Charles).

In:
# Reset index first and drop the original index
df_3_reset = df_3.reset_index(drop=True)
# Use boolean indexing and .ne method on reset index
df_3_reset[df_1.ne(df_3_reset).any(axis=1)]
Out:
Player Rings
1 Kobe 3
3 Larry 5
6 KG 1
7 Charles 0

Not too bad right? One problem with doing it this way though is that it requires the second dataframe’s index (post reset) to be lined up with the first’s. In other words, the overlapping players in df_1 and df_3 must be the same (and in the same order), which they are — LeBron, Kobe, Michael, Larry, Magic, Tim.

But what if they’re not? Say instead that we downloaded some new data that we want to incorporate into our dataset. Unfortunately, the new data has some redundancy with our existing data, namely LeBron, Michael, and Magic.

# New data to add to our dataset
array_new = np.array([['LeBron',3],
['Michael',6,],
['Magic',5],
['KG',1],
['Charles',0],
['Stephen',3],
['Patrick',0]])
df_new = pd.DataFrame(array_new,
columns=['Player','Rings'])

The new data looks like this. Note that Michael’s index is 2 in df_1 but 1 here, and Magic’s index is 4 in df_1 but 2 here.

    Player Rings
0 LeBron 3
1 Michael 6
2 Magic 5
3 KG 1
4 Charles 0
5 Stephen 3
6 Patrick 0

Let’s use our previous methods for comparing dataframes, .eq and .ne. First let’s use .eq (and .all) to see what’s the same between the new data and our existing data:

In:
df_new[df_1.eq(df_new).all(axis=1)]
Out:
Player Rings
0 LeBron 3

It says that only LeBron’s entry is the same even though Michael and Magic’s data are the same too. The issue, as we already know, is the different index. If we use .ne (and .any) to identify the new or different rows, we get a long list including ones that we don’t want to be in there — we don’t want to mistakenly insert redundant Michael and Magic entries into our data.

In:
df_new[df_1.ne(df_new).any(axis=1)]
Out:
Player Rings
1 Michael 6
2 Magic 5
3 KG 1
4 Charles 0
5 Stephen 3
6 Patrick 0

Using Merge

In this case, the best way (that I know of) is to use the .merge method (I wrote a post about merge here, so read this if you need background on .merge). It’s a bit more roundabout but by using merge, we can compare just the values of each entry, irrespective of the indexes.

Merging using the Player and Rings columns allows us to match up rows with identical values between our existing and new dataframes (while ignoring differences in the dataframe’s indexes). We need to rename the Rings column in our new dataframe to get it to output as two separate columns (Rings for the old and Rings_new for new). You will see why once we examine the output.

# Need to rename Rings since we are merging on it but we want
# it to show as different columns post-merge
temp = df_new.rename({'Rings': 'Rings_new'}, axis=1)
merged = temp.merge(df_1, how='left',
left_on=['Player','Rings_new'],
right_on=['Player','Rings'])

I’ve printed the contents of the dataframe called merged below. The NaNs are what we are looking for (and why we needed both columns to show). Because we did a left join, the output includes every row in df_new — the players that we already have data for in df_1 have numeric values in the Rings column. The players that are new are denoted by a NaN value in the Rings column — these are the only ones that we want to add to our dataset.

    Player Rings_new Rings
0 LeBron 3 3
1 Michael 6 6
2 Magic 5 5
3 KG 1 NaN
4 Charles 0 NaN
5 Stephen 3 NaN
6 Patrick 0 NaN

We can isolate out the new entries by slicing the dataframe for just the rows with a NaN value in Rings:

In:
df_new[merged['Rings'].isna()]
Out:
Player Rings
3 KG 1
4 Charles 0
5 Stephen 3
6 Patrick 0

And concatenate to our dataframe like so:

final_df = pd.concat([df_1,
df_new[merged['Rings'].isna()]],
axis=0)

And finally, we get what we want. The contents of final_df looks like:

    Player Rings
0 LeBron 3
1 Kobe 5
2 Michael 6
3 Larry 3
4 Magic 5
5 Tim 4
3 KG 1
4 Charles 0
5 Stephen 3
6 Patrick 0

Note that if there was an incorrect entry for an existing player like [LeBron, 0] in the new data, it would also have been inserted because we merge on both Player and Rings (so for the entries to match, both the names and ring counts need to be identical). If we don’t want this behavior, then we could merge solely on the Player column. This could cause issues too as there might be another player with the same name but a different ring count (like Michael Cooper with his 5 rings versus Michael Jordan with his 6). Of course with a real dataset, we would include enough fields so that we could uniquely identify every player (e.g. first name, last name, birth date).

Hope this was insightful and good luck with your dataframes! Cheers!

How do you compare rows of two DataFrames?

The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side. The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.

How do you find the different rows between two data frames in Python?

You can use the DataFrame. diff() function to find the difference between two rows in a pandas DataFrame.

How do you compare two DataFrames and find the difference?

By using equals() function we can directly check if df1 is equal to df2. This function is used to determine if two dataframe objects in consideration are equal or not. Unlike dataframe. eq() method, the result of the operation is a scalar boolean value indicating if the dataframe objects are equal or not.

How do I combine two DataFrames with different rows and columns?

Let's merge the two data frames with different columns. It is possible to join the different columns is using concat() method..
DataFrame: It is dataframe name..
Mapping: It refers to map the index and dataframe columns..
axis: 0 refers to the row axis and1 refers the column axis..
join: Type of join..