{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## MIDTERM - W200 Introduction to Data Science Programming, UC Berkeley MIDS" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Instructions\n", "The midterm exam is designed to evaluate your grasp of Python theory as well as Python coding.\n", "\n", "- This is an individual open book exam.\n", "- You have 48 hours to complete the exam and upload it back to ISVC, starting from the point when you first accessed it on ISVC.\n", "- You may use any libraries from the Python Standard Library for this exam: https://docs.python.org/3/library/\n", "\n", "For the coding questions, follow best practices. Partial credit may be given for submissions that are clearly commented and well organized." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Short Questions (15 pts)\n", "\n", "1.1. Python's dynamic typing allows one variable to refer to multiple types of objects within the same program execution. This can speed program development. Name one disadvantage of dynamic typing." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Though dynamic typing increases speed of program's development, it slows down the execution speed of the program as the interpreter ha to check the type of value every variable contains. Dynamic typing can also cause runtime errors." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1.2. Compiled languages typically offer faster performance than interpreted languages. List two reasons why would you choose an interpreted language like Python for the purpose of analyzing a data set instead of a compiled language." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "They are - \n", "1) Ease of debugging\n", "2) small program size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1.3. We have gone over FOR and WHILE loops. Discuss one reason to use a for loop over a while loop and one reason to use a while loop over a for loop. Please elaborate beyond a single word." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Reason to use for loop- For loop is used when the exact number of iterations are know as the initialization, condition and update statement all are contained in the for structure.\n", "\n", "Reason to use while loop- While loop is used when the number of iterations are not known as it only contains the condition statement." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Programming Styles (10 pts)\n", "\n", "We have taught you a number of ways to program in Python. These have included using scripts versus functions and using Jupyter notebooks versus calling .py files from the command line. Describe a scenario in which you would use .py files from the command line over a Jupyter notebook and the opposite. Describe a scenario in which you would use a script versus a function and the opposite. There are four cases and each answer should be about 1-2 sentences, explain why!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2.1. I would use a Jupyter notebook:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Jupyter Notebook provides you with an easy-to-use, interactive data science environment that doesn’t only work as an integrated development environment (IDE), but also as a presentation or educational tool." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2.2. I would use .py files over a Jupyter notebook:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ".py files are organised, easier to debug and ideal for production while ipynb files are unorganised, difficult to debug and are not production ready." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2.3. I would use a script over a function:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scripts are helpful for tasks that does not change. It is like a keyboard macro, when the name is typed the entire script commands are exectued as if they were written there while in a function, you use the call stack and jump the execution pointer to the function." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2.4. I would use a function over a script:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Functions are great to use when there are changes in code, for example when a set of commands can have different inputs and outputs while scripts can only be used with the same commands and variables as written." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Dictionaries vs Lists (10 pts)\n", "\n", "Suppose we have the following problem. We have a dictionary of names and ages and we would like to find the oldest person.\n", "\n", "```\n", "ages_dict = {'Bill':34, 'Fred':45, 'Alice':14, 'Betty':17}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dictionary approach\n", "Here is a loop that shows how to iterate through the dictionary:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fred is the oldest.\n" ] } ], "source": [ "ages_dict = {'Bill':34, 'Fred':45, 'Alice':14, 'Betty':17}\n", "\n", "max_age = 0\n", "max_name = ''\n", "for name,age in ages_dict.items():\n", " if age > max_age:\n", " max_age = age\n", " max_name = name\n", " \n", "print(max_name, \"is the oldest.\") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List approach \n", "\n", "Your friend comes to you and thinks that this dictionary is difficult to deal with and instead offers a different plan using two lists.\n", "\n", "```\n", "names = ['Bill', 'Fred', 'Alice', 'Betty']\n", "ages = [34, 45, 14, 17]\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of using a loop, your friend writes this code to find the oldest person." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fred is the oldest.\n" ] } ], "source": [ "names = ['Bill', 'Fred', 'Alice', 'Betty']\n", "ages = [34, 45, 14, 17]\n", "\n", "max_age = max(ages)\n", "index_max = ages.index(max_age)\n", "max_name = names[index_max]\n", "\n", "print(max_name, 'is the oldest.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Discussion\n", "Discuss the advantages and disadvantages of each of the approaches. \n", "\n", "3.1. Is one more efficient (i.e. faster in Big O notation) than the other?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Time taken by both approaches are same in this case as you'll have to go through n values(whole data structure) O(n). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3.2. Why would you prefer the dictionary?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We would prefer the dictionary as it keeps data in a single data structure and you can directly look up the age of any person with his/her name without iterating throughout the data structure." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3.3. Why would you prefer the two lists?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We would prefer two list when our searcing factor can be both the name as well as age." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. Mutability Surprises (15 pts)\n", "\n", "4.1. In the asynchronous sessions, we discussed mutability. Please describe, in your own words, why mutability is a useful feature in Python lists and dictionaries." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mutability is the ability to allow one person change data inside the structure. Since lists and dictionaries are mutable, they can be used for multiple purposes where data keeps on changing and needs to be updated." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mutability can also lead to unexpected behavior - specifically when multiple variables point to the same object or when mutable objects are nested inside each other. \n", "\n", "4.2. Please write some code demonstrating a situation where mutability could lead to unexpected behavior. Specifically, show how updating just one object (list_a) can change the value when you print a second variable (list_b)." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[5, 4]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list_a = [5]\n", "list_b = list_a\n", "list_a.append(4)\n", "list_b\n", "#As you can see updating a changes the value shown by b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4.3. Show how \"copy\" or \"deepcopy\" could be used to prevent the unexpected problem you described, above." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[5]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list_a = [5]\n", "list_b = list_a.copy()\n", "list_a.append(4)\n", "list_b\n", "#As you can see updating a does not change the value shown by b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4.4. Now, show the same problem using two dictionaries. Again show how \"copy\" or \"deepcopy\" can fix the issue." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3\n", "5\n", "3\n", "3\n" ] } ], "source": [ "#Problem\n", "dict_a = {1:3, 2:4}\n", "dict_b = dict_a\n", "print(dict_b[1])\n", "dict_a[1] = 5\n", "print(dict_b[1])\n", "\n", "#fix\n", "dict_a = {1:3, 2:4}\n", "dict_b = dict_a.copy()\n", "print(dict_b[1])\n", "dict_a[1] = 5\n", "print(dict_b[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4.5. Can this unexpected behavior problem occur with tuples? Why, or why not?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This problem cannot occur with tuples as they are unmutable, once made they cannot be changed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5. Tweet Analysis (15 pts)\n", "\n", "A tweet is a string that is between 1 and 280 characters long (inclusive). A username is a string of letters and/or digits that is between 1 and 14 characters long (inclusive). A username is mentioned in a tweet by including @username in the tweet (notice the username does not include the `@` symbol). A retweet is way to share another user's tweet, and can be identified by the string RT, followed by the original username who tweeted it.\n", "\n", "Your job is to fill in the function *count_retweets_by_username* so that it **returns** a frequency dictionary that indicates how many times each retweeted username was retweeted." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "tweets = [\"This is great! RT @fakeuser: Can you believe this?\",\n", " \"It's the refs! RT @dubsfan: Boo the refs and stuff wargarbal\",\n", " \"That's right RT @ladodgers: The dodgers are destined to win the west!\",\n", " \"RT @sportball: That sporting event was not cool\",\n", " \"This is just a tweet about things @person, how could you\",\n", " \"RT @ladodgers: The season is looking great!\",\n", " \"RT @dubsfan: I can't believe it!\",\n", " \"I can't believe it either! RT @dubsfan: I can't believe it\"]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "def count_retweets_by_username(tweet_list):\n", " \"\"\" (list of tweets) -> dict of {username: int}\n", " Returns a dictionary in which each key is a username that was \n", " retweeted in tweet_list and each value is the total number of times this \n", " username was retweeted.\n", " \"\"\"\n", " \n", " # write code here and update return statement with your dictionary\n", " dictionary = dict()\n", " for tweet in tweet_list:\n", " if (tweet.find('RT @') != -1):\n", " start = tweet.find('RT @')\n", " end = tweet.find(':')\n", " if tweet[start:end] in dictionary:\n", " dictionary[tweet[start+4:end]] += 1\n", " else:\n", " dictionary[tweet[start+4:end]] = 1\n", " \n", " return dictionary" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'fakeuser': 1, 'dubsfan': 1, 'ladodgers': 1, 'sportball': 1}\n" ] } ], "source": [ "# allow this code to work by implementing count_retweets_by_username function above\n", "print(count_retweets_by_username(tweets))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6. Looking for Minerals (20 pts)\n", "\n", "A mining company conducts a survey of an n-by-n square grid of land. Each row of land is numbered from 0 to n-1 where 0 is the top and n-1 is the bottom, and each column is also numbered from 0 to n-1 where 0 is the left and n-1 is the right. The company wishes to record which squares of this grid contain mineral deposits.\n", "\n", "The company decides to use a list of tuples to store the location of each deposit. The first item in each tuple is the row of the deposit. The second item is the column. The third item is a non-negative number representing the size of the deposit, in tons. For example, the following code defines a sample representation of a set of deposits in an 8-by-8 grid." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "deposits = [(0, 4, .3), (6, 2, 3), (3, 7, 2.2), (5, 5, .5), (3, 5, .8), (7, 7, .3)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6.1. Given a list of deposits like the one above, write a function to create a string representation for a rectangular sub-region of the land. Your function should take a list of deposits, then a set of parameters denoting the top, bottom, left, and right edges of the sub-grid. It should **return** (do not print in the function) a multi-line string in which grid squares without deposits are represented by \"-\" and grid squares with a deposit are represented by \"X\"." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "----X---\n", "--------\n", "--------\n", "-----X-X\n", "--------\n", "-----X--\n", "--X-----\n", "-------X\n", "\n" ] } ], "source": [ "def display(deposits, top, bottom, left, right):\n", " \"\"\"display a subgrid of the land, with rows starting at top and up to \n", " but not including bottom, and columns starting at left and up to but\n", " not including right.\"\"\"\n", " ans = \"\"\n", " arr = [[0 for i in range(bottom - top)] for i in range(bottom - top)]\n", " for deposit in deposits:\n", " arr[deposit[0]][deposit[1]] = deposit[2]\n", " for i in range(bottom - top):\n", " for j in range(bottom - top):\n", " if arr[i][j] == 0:\n", " ans += \"-\"\n", " else:\n", " ans += \"X\"\n", " ans += \"\\n\"\n", " return ans\n", "print(display(deposits, 0, 8, 0, 8))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, your function should replicate the following behavior for the example grid:\n", "```\n", "print(display(deposits, 0, 8, 0, 8))\n", "----X---\n", "--------\n", "--------\n", "-----X-X\n", "--------\n", "-----X--\n", "--X-----\n", "-------X\n", "\n", "print(display(deposits, 5, 8, 5, 8))\n", "X--\n", "---\n", "--X\n", "\n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6.2. Next, complete the following function to compute the total number of tons in a rectangular sub-region of the grid." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7.1\n" ] } ], "source": [ "def tons_inside(deposits, top, bottom, left, right):\n", " \"\"\"Returns the total number of tons of deposits for which the row is at least top,\n", " but strictly less than bottom, and the column is at least left, but strictly\n", " less than right.\"\"\"\n", " # Do not alter the function header. \n", " # Just fill in the code so it returns the correct number of tons.\n", " sum = 0\n", " for deposit in deposits:\n", " if top <= deposit[0] < bottom and left <= deposit[1] < right:\n", " sum += deposit[2]\n", " return sum\n", "print(tons_inside(deposits, 0, 8, 0, 8))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## 7. Birthday planning (15 pts)\n", "\n", "Suppose you record a list of birthdays for your classmates, recorded as month day tuples. An example is given below." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "# The 2nd to last tuple needs the int(2) in it so that it is uniquely stored in memory compared to (2,8)\n", "# Under the hood Python 3.7 changed how these are stored so (2,8) and (2,8) are stored in the same location\n", "# and then the algorithm below doesn't work\n", "\n", "dates = [(3,14),(2,8),(10,25),(5,17),(3,2),(7,25),(4,30),(8,7),(int(2),8),(1,22)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You read about the famous birthday problem and you become interested in the number of pairs of classmates that share the same birthday. Below is an algorithm you write to do this. (Note: the ```is``` operator tests that two operands point to the same object)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total birthday pairs: 1\n" ] } ], "source": [ "count = 0\n", "\n", "for person_a in dates:\n", " for person_b in dates:\n", " # Make sure we have different people \n", " \n", " if person_a is person_b:\n", " continue\n", " \n", " # Check both month and day\n", " if person_a[0] == person_b[0] and person_a[1] == person_b[1]:\n", " count += 1\n", " \n", "# We counted each pair twice (e.g. jane-bob and bob-jane) so divide by 2:\n", "print(\"Total birthday pairs:\", count//2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7.1. What is the (tightest) Big-O running time bound for the above algorithm? You may assume that simple operations like equality check, addition, and print take constant time." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "O(n^2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7.2. You notice that your algorithm is inefficient in that it counts each pair twice. For example, it will increment count once when person_a is Jane and person_b is Bob, and again when person_a is Bob and person_b is Jane. Below, revise the algorithm so that it only looks at each pair once." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total birthday pairs: 1\n" ] } ], "source": [ "count = 0\n", "\n", "for i in range(len(dates)):\n", " for j in range(i, len(dates)):\n", " # Make sure we have different people \n", " person_a = dates[i]\n", " person_b = dates[j]\n", " if person_a is person_b:\n", " continue\n", " \n", " # Check both month and day\n", " if person_a[0] == person_b[0] and person_a[1] == person_b[1]:\n", " count += 1\n", "print(\"Total birthday pairs:\", count)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7.3. What is the (tightest) Big-O running time bound for your new algorithm? What does this tell you about whether your revision was worth making?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is still O(n^2) since the iterations are reduced by a factor of 2 but the program still computes like (n * n -1 * n - 2 ...)\n", "Though the Big-O Notation doesn't decrease it tells that the decision wasn't worth making (but it is still worth making since in real life decreasing complexity by a factor of 2 is a huge deal)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7.4. Finally, create a third revision of your algorithm which has a faster Big-O running time bound that both the previous algorithms." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total birthday pairs: 1\n" ] } ], "source": [ "count = 0\n", "date = {}\n", "for i in range(len(dates)):\n", " key = \"{}/{}\".format(dates[i][0], dates[i][1])\n", " if key in date:\n", " count += 1\n", " else:\n", " date[key] = 1\n", " \n", "print(\"Total birthday pairs:\", count)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7.5. What is the (tightest) Big-O running time bound for your last algorithm? Explain what trade-off you made to have a faster running time." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "The runtime is O(n log n). We use a dictionary (which takes O(n) space which the variables didn't take) to reduce search time as look up in an array is O(n) while in a dictionary is O(log n)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" }, "widgets": { "state": {}, "version": "1.1.2" } }, "nbformat": 4, "nbformat_minor": 1 }