My Blog

Update on the website

So finally, I am done with the website. I ended up restructuring the entire site. I had prepared a checklist for the things which I wanted to include in the website. I did complete 9 out of 11 things. One thing was to get the latest blogs (Tumblr and WordPress) programmatically. It wasn’t straightforward so for now I have just hard coded some blog links there. The other one was to include a basic contact me form which would have a simple email text box, message body and send button. So these are the two items pending as of now.

So last time I shared how I included my top Stack Overflow answers using stackr package. I did manage to get my top Quora answers by scraping my profile page on Quora. It was a simple rvest query in R which I was able to achieve by asking this question on Stackoverflow .  In the front page , I just kept my photo, link to recent SO answers, link to recent Quora answers and the twitter widget. I added a slider input which included some more tab options. The front tab was called “About Me”.

Named the second tab as Side projects where I included brief description of all the projects which I had worked upon. Included bsetools, bsedata, twitter bots and project Euler information. Also included the relevant github and information links in there. The next tab was of resume where I have included my short resume and finally the contact me section where I had to include the contact form however I have included some technical blogs from wordpress and some personal blogs from tumblr. Also included links to various other websites where I am present.

This is the current version of the website and I am not planning to include anything more soon. Just for reference the website is at – https://shahronak.shinyapps.io/my_shiny_app/

Advertisements

What data does LinkedIn collect – Part 2

So I got the complete data today and it was a superset of the one which was sent in Part 1. So all the data which was there in Part 1 was repeated again here as well. This is what the folder structure looked like.

1

So let’s look at the files ignoring the ones which we have covered previously.

  • Shares.csv contains all the post shared by you, its unique URL, the shared URL etc.

2

Sadly, I have shared only one post on LinkedIn until now.

  • Security Challenges.csv wasn’t completely clear to me but I believe it was the time when you were logging in from a new place/computer for the first time and a security challenge was asked to you to verify your identity.

3

  • Search Queries.csv has all the records for all the search terms you did on LinkedIn be it about a company or a person or along with their time.

4

  • Logins.csv – Has all the instances of my logins but surprisingly, it doesn’t have any recent entries and also it has very limited rows. It is surprising because I login to LinkedIn every day and there seems to be no record of it in this sheet. Maybe it has only records when I really “log in” into the website i.e when I enter my username and password to login. I usually have LinkedIn logged in already on my personal and work computer.

5

  • Likes.csv – Contains all the post that you have liked before, the time when you liked it and the content of the post.

6

  • Group Comments.csv – If you are a member of any group and if you have posted/commented anything in it then this gets recorded here. This has got table columns like Time of the post, your comment, the subject of the post and the original text that you have commented on and it’s URL.
  • Endorsement Received Info.csv has the details of all the endorsement you have received, the person who endorsed it and what time they endorsed it.

 7

  • Coments.csv includes all the post which you have commented upon. Note that this is different from group comments.csv as that csv is only related to comments in a group. However, the csv contains the same data as that file. Original post, your comment on that post, time of comment and its URL.

  • Ad Targeting.csv is again all my information. I was pretty interested to see this csv as I was expecting all huge data present here based on my behavior on the site but unfortunately it had only one row.

8

Just FYI, whatever algorithm they used to determine my age, they have got it wrong. 😛

  • Account Status History.csv had the information about when the account was created.

 9

Looking at all the data it doesn’t seem to be storing anything extra and something which is out of LinkedIn. Everything stored is obvious and we know it already.

What data does LinkedIn collect ? – Part I

I am more of a passive user on LinkedIn. I hardly like, share or post any status on LinkedIn. However, just out of curiosity I downloaded the data which LinkedIn stored for me and decided to analyze it. To get your data it is quite easy. Go to your LinkedIn account, go to settings, and select Privacy and go to download your Data. It will ask you to enter your LinkedIn password for security reasons.

a

After that it will notify that they will notify you once the data is ready to download. Then immediately within 10-15 minutes I received an email from LinkedIn saying that Part 1 of your data is ready for download. It also says that the second part of your data would be ready within 24 hours. It will redirect to your LinkedIn profile again to download the data from where you can download the zip file.

After unzipping the file, this is how the folder structure looks:

b

1) Videos.csv contained just one line https://www.linkedin.com/psettings/member-data/videos and when I went to that link it said “You haven’t uploaded any videos L “which is true. I haven’t uploaded any videos on LinkedIn.

2) Skills.csv contained all the skills which I had mentioned on the profile. Some of them were R, Data Analysis, Statistics, Data Science etc. The skills which other people endorse us for (even if we don’t have it 😛 ). I actually even expected them to store the count of how many people have endorsed the skill because that is an important number to keep. Anyway, if they are showing it on the profile they must be definitely be storing it somewhere.

3) Registration.csv had the details when I registered for the website, I suppose. I don’t actually clearly recall the date and time I signed up for LinkedIn but I am assuming this is correct. The other columns were blank.

c

From the IP address, I checked what details I can find out. A basic search reveals this details from IP.

d

So, I was at home when I signed up for LinkedIn.

4) Projects.csv includes the projects which you have added on LinkedIn along with its description, URL (if any), its start and end date as well.

e

5) Profile.csv maintains details of your profile which you have shared. Your name, address, birth date etc.

f

6) Positions.csv keeps a record of all your employment details. The organizations you have worked with, your title there, the duration for which you worked etc.

g

7) Phone Numbers.csv contained only my phone number in it.

8) Messages.csv had all the conversation/messages I had over LinkedIn. All the messages sent/received. One thing worth noticing was it had 1074 messages in total and the last message was from November 2014. It is hard to believe that I did not receive any messages from the time I signed up on LinkedIn from 2011 to 2014. Or do they show only top 1000 messages or so?

h

9) Languages.csv contains languages and their proficiency.

i

10) Invitations.csv contains information about all the invitations sent by you and received by you. The time the invitation was send and if any invitation message was sent along with it. Also this had around 2k rows which is way less than my total connections and my oldest connection was from 2017 so I believe even this has some filter in it like messages.csv.

j

11) Imported Contacts.csv has all the contacts which you have imported from your personal email address. First and Last Name of imported contact, their email address, when was the contact imported , their phone number (if any).

12) Email Addresses.csv includes your email address. I had two email address, one was primary and the other one secondary. It also has a flag if the email addresses are verified or not.

13) Education.csv like the Positions.csv has details of your education whatever you have uploaded.

k

14) Courses.csv includes all the courses you have taken which you have included on the platform.

15) Connections.csv Now this is I think the most important csv of all. This has got list of all your connections along with their email addresses, the company they work for, their position and the time when you were connected with them. So one thing we need to keep in mind is that when we are connecting with anybody we are giving away our email address to them.

16) Certifications.csv has the certificates which you have included.

17) Cause you Care About.csv This too is straight forward info which you have shared.

Media Files – has the media files that you have shared on the platform. Any images / document that is uploaded.

Jobs – This gets divided into two parts. One csv is for our job preference (Job Seeker Preference.csv) which even I don’t remember when I set it up. Says I am looking for job casually and ready to join in 4 to 6 months. The other csv which is Job Applications.csv has all the details of all the job you have applied it till now. The time when you applied, the title for which you applied, the company name and the resume name which you uploaded.

l

What am I working on recently?

I know I haven’t written a blog since a long time. I haven’t solved any Euler Problem since last time and the thing I am working on is a long never ending process. I usually write a blog once I finish doing something but as this was a continuous process I didn’t bother to write about it. However, I realized I should stop and just update on what’s going on and what I am working on.

I am building my web page. I know sharonak47.wordpress.com is already a web page and I could do many things with wordpress but it has been long since I have edited and modified anything on wordpress, so if I had to continue using it then I had to learn wordpress again and reinvent the wheel. I don’t mind doing that but then I thought learning and using wordpress would not be that effective for me in the long run. I don’t want to be a wordpress developer in near future. How would it be if I could do this in my favorite language? Yes, you guessed it right. I am building a website in R 😀 ;-). More or less it would just be a single page website. Shiny has become quite powerful and has grown a lot in the last couple of years. There are many amazing things one can do using shiny. Although, the web page I am building is pretty basic and won’t be using all the powers of shiny. The idea behind this website is to be a central place of all my web presence. So if anybody reaches my website will be able to find me anywhere (only online though :P).

You can find the website at https://shahronak.shinyapps.io/my_shiny_app/ (I know I need a better URL, but this is what shinnyapps provide 😛 ). This is still in development phase and you would see some changes in next few weeks. Now deciding what should I include in the website. I was pretty sure that I want to include my stack overflow answers in there (yes, I am pretty proud of them). There are quite a few packages in R to play with stack overflow API but most of them are half baked and old. I found stackr package from David Robinson which worked quite well. I provide my user_ID (3962914) and get recent answer_ID. Using that answer id’s I generate the URL for my answers by appending (https://stackoverflow.com/a/) to the answer_id. For display purpose I cannot display the entire answer body, so then I decided to display question title and then anchor it with answer url. We get question_id from the same call as above using stack_users function and we get the title of those questions from stack_questions functions. We display only recent 6 answers. That integrated pretty well. It updates as soon as I post any answer on Stack Overflow and doesn’t need to check anything manually.

I also was planning to display my top answers from Quora similarly. Unfortunately, there are no packages for Quora to do the same in R. Moreover, Quora does not even have an official API support which makes it more difficult. There are few packages in Python which allows to query Quora but nothing in R I could find. So at least, this is postponed as of now as there is no straight forward approach. Added a twitter widget from Twitter widgets which shows my recent tweets in the sidebar. This is the same which I used in wordpress site so this was pretty simple. Apart from all these I have added links to my profiles on various other platforms like Facebook, Linkedin, Github, Quora and WordPress blog. I also wanted to show titles of my recent blogs  from wordpress but even that is not straight forward. Further , I also plan to include somewhere my side projects like the couple of twitter bots which I have developed, bsetools and bsedata projects.

Gave it a resume kind of look by including my employment and educational history. Working on client side was more of working with HTML and CSS. I have some basic skills of that which I am using here. Let’s see how far I can take this ahead.

Euler Problem 58 – Spiral Primes

Spiral primes

Problem 58

Starting with 1 and spiralling anticlockwise in the following way, a square spiral with side length 7 is formed.

37 36 35 34 33 32 31
38 17 16 15 14 13 30
39 18  5  4  3 12 29
40 19  6  1  2 11 28
41 20  7  8  9 10 27
42 21 22 23 24 25 26
43 44 45 46 47 48 49

It is interesting to note that the odd squares lie along the bottom right diagonal, but what is more interesting is that 8 out of the 13 numbers lying along both diagonals are prime; that is, a ratio of 8/13 ≈ 62%.

If one complete new layer is wrapped around the spiral above, a square spiral with side length 9 will be formed. If this process is continued, what is the side length of the square spiral for which the ratio of primes along both diagonals first falls below 10%?

At first I didn’t get how was the spiral formed. I was confused about how the numbers were generated. After reading it the second time, I remembered there was something similar which we had dealt with previously. Blog for which is here and the code for which can be found here. With the help of that code, the main work was done to get the spiral matrix working to generate the sequence of four numbers on diagonal which we were interested in. Now the remaining work was to find out how many of those 4 numbers were actually prime and calculate the ratio of total number of primes on diagonal to total number of elements on diagonal.

So, we have two variables, previous_prime_count which keeps track of number of primes on diagonal and previous_length which keeps track of total number of elements on diagonal. At each iteration, we add 4 elements on diagonal irrespective anyway. We have the is_prime function which checks out of those four elements how many of them are prime and adds it to the previous_prime_count variable. We keep on generating new sequences using the formula until the ratio of previous_prime_count by previous_length goes below 0.1.

This was nice and simple program, however, I immediately started running into issues as soon as I ran this program. The initial 200 iterations were fine and were covered easily but it became extremely slow when the ratio was around 0.2. I thought it is almost done and reaching from 0.2 to 0.1 would hardly take any time but I was so wrong. To reach from 0.2 to 0.1 it took 3 days. 3 long days……yeah. I know it is crazy but it did take that much time. Initially I didn’t even get where the problem was as everything was being done at constant time so I was not sure where it is taking so much of time. However, I realized later that as the numbers increased it was taking more and more time in is_prime function. Turns out I have not written the most efficient is_prime function. Ideally, the loop should go from 2 to sqrt(n) to check if the number is prime or not but I have been checking it till n/2 which is a very big performance loss.

source("/Users/Ronak Shah/Google Drive/Git-Project-Euler/1-10/3.Check_if_Prime.R")

flag = TRUE
i = 0
previous_prime_count = 0
previous_length = 1
while(flag) {
 i = i + 1
 constant = 2 * i
 length_i = constant + 1
 second_number = 4 * (i^2) + 1
 first_number = second_number - constant
 third_number = second_number + constant
 fourth_number = third_number + constant
 previous_prime_count = previous_prime_count + sum(is_prime(c(first_number, 
 second_number, third_number, fourth_number)))
 previous_length = previous_length + 4
 ratio = previous_prime_count / previous_length
 cat(ratio, length_i, "\n")
 if (ratio < 0.1)
  flag = FALSE
}

previous_length
#[1] 26241

 

Euler Problem 57 – Square Root Convergents

Square root convergents

Problem 57

It is possible to show that the square root of two can be expressed as an infinite continued fraction.

√ 2 = 1 + 1/(2 + 1/(2 + 1/(2 + … ))) = 1.414213…

By expanding this for the first four iterations, we get:

1 + 1/2 = 3/2 = 1.5
1 + 1/(2 + 1/2) = 7/5 = 1.4
1 + 1/(2 + 1/(2 + 1/2)) = 17/12 = 1.41666…
1 + 1/(2 + 1/(2 + 1/(2 + 1/2))) = 41/29 = 1.41379…

The next three expansions are 99/70, 239/169, and 577/408, but the eighth expansion, 1393/985, is the first example where the number of digits in the numerator exceeds the number of digits in the denominator.

In the first one-thousand expansions, how many fractions contain a numerator with more digits than denominator?

At first, this looked like a complicated question. I was scared of large decimal numbers. After thinking for a while I noticed one thing that at every step we are adding ½ to the previous solution which is constant. So the sequence which is generated should be deterministic. I kept the numerator and denominator separately, no need to keep them as a fraction and complicate things. First concentrating on numerator, the sequence is 3, 7, 17, 41, 99, 239 .. I tried to create a formula which generates the following sequence. I was unsuccessful but then I checked on oeis website to understand the sequence where I found A001333 sequence which gives the formula for generating the numerator sequence which is

A(n) = 2 * A(n-1) + A(n-2)

Upon testing this clearly satisfied the numbers shown. Now, when I checked for sequence of denominators it gave me a different sequence but with the same formula. It was surprising to know that both numerators and denominators were being generated using the same formula. As everything was deterministic then I thought even the index when number of characters in numerator is greater than denominator should be deterministic as well. I printed out all such indexes,

8, 13, 21, 26, 34, 39, 47, 55, 60, 68, 73

were the indexes. There is also a page on this sequence as well. However, unfortunately there is no formula to generate this sequence. If there was a formula then we could have generated the sequence till 10000 and calculated the number of elements in it and that would have been the answer.

However, as the formula is not available we use the traditional approach to count number of elements where number of characters in numerator is greater than denominator. We need two initial numbers to generate the next number which we can get from the example itself. For numerator, we use 3 and 7 whereas for denominator we use 2 and 5. We generate the next number using the formula :
For numerator :

A(n) = 2 * A(n-1) + A(n-2)
A(n) = 2 * 7 + 3
A(n) = 17

And for denominator

A(n) = 2 * 5 + 2
A(n) = 12

And we keep on repeating this procedure for 1000 iterations. It started giving answer as Inf towards the end which meant that R was incapable of handling such large values. So to overcome that, we use gmp library. There is a function called mul.bigz to multiply large numbers. Rest all is simple R code.

library(gmp)
num_1 = 3
num_2 = 7
denom_1 = 2
denom_2 = 5
count = 2
total_num = 0

while(count < 1000) {   
  temp = num_1
  num_1 = num_2   
  num_2 = mul.bigz(as.bigz(2), as.bigz(num_2)) + temp
  temp = denom_1 
  denom_1 = denom_2  
  denom_2 = mul.bigz(as.bigz(2), as.bigz(denom_2)) + temp
  count = count + 1
  if (nchar(as.character(num_2)) > nchar(as.character(denom_2)))
    total_num = total_num + 1
}
total_num
#[1] 153

system.time()
#user  system elapsed 
#0       0       0 

 

Euler problem 56 – Powerful digit sum

Powerful digit sum

Problem 56

A googol (10100) is a massive number: one followed by one-hundred zeros; 100100 is almost unimaginably large: one followed by two-hundred zeros. Despite their size, the sum of the digits in each number is only 1.

Considering natural numbers of the form, ab, where a, b < 100, what is the maximum digital sum?

This is another straight forward problem which took me just 15 – 20 minutes to solve. Like last problem, this problem has to go through two loops each one going from 1-99. Raise a raise to b, sum the digits and return the max answer. The logic is simple, however, there are two things which need to be taken care of. First, you need to set the options(digits = 22) and options(scipen=999) because as the numbers increases then they are shown in scientific notations which later creates problem while calculating sum of digits. One cannot sum 1.5e10 and such numbers.

Another small adjustment is you cannot get all the powers easily in base R itself. As the numbers increases, you get wrong answers. There was always a problem with higher precision digits as we have already experienced in previous problems so we use pow.bigz function from gmp library to calculate higher powers sum. We had already written a function to calculate sum of digits, so I used the same function .

library(gmp)
options(digits = 22)
options(scipen = 999)

max_digit_sum = 0
for (a in seq(99)) {
 for (b in seq(99)) {
  cat(a, b, "\n")
  digit_sum = sum(as.numeric(unlist(c(strsplit(as.character(pow.bigz(a , b)), "")))))
  if (digit_sum > max_digit_sum) {
   max_a = a
   max_b = b
   max_digit_sum = digit_sum
  }
 }
}
max_a
#[1] 99

max_b
#[1] 95

max_digit_sum
#[1] 972

system.time()
#user  system elapsed
#0.19   0.00   0.20

 

Euler Problem 55 – Lychrel numbers

The problem statement is quite big. You can view it here.

The question is straight forward especially if you follow the brute force method. Basically we would have two loops. Outer loop would take value from 1 to 10000 while the inner loop would run for 50 iterations.

For every number we add it with its reversed number and check if the sum is a palindrome. We continue reversing and adding until we find a palindrome. We check this only for 50 iterations and any number which exceeds this 50 iteration counter we consider it as a Lychrel number and add them to that list. Simple right?

However, there is only one point where I got stuck a bit for couple of minutes. I was checking if a number is a palindrome and then add it’s reversed digit to it and then check again. However, we first need to add and then check, I was doing it opposite. It is clearly mentioned in the question,

there are palindromic numbers that are themselves Lychrel numbers; the first example is 4994.

So if I check 4994 is palindrome in first step itself, then it will satisfy the requirement of being palindrome and would not consider it as Lychrel number which is obviously wrong. So we first need to add the number with it’s reversed number and only then check if it is a palindrome.

library(stringi)
lychrel_num = numeric()
lychrel_nums_under_n <- function(limit) {
  for (i in seq(limit)) {
    num = i + as.numeric(stri_reverse(i))
    count = 0
    while(num != stri_reverse(num) & count < 50) {
      num = num + as.numeric(stri_reverse(num))
      count = count + 1
    }
    if(count == 50)
      lychrel_num = c(lychrel_num, i)
  }
  return(lychrel_num)
}

I could have written a function to check if a number is a palindrome or not. However, there is already a function in stringi package called stri_reverse which reverses a digit, so I have used the same function.

length(lychrel_nums_under_n(10000))
#[1] 249

system.time(lychrel_nums_under_n(10000))
#user  system elapsed 
#0.75    0.00    0.75 

 

Using bsetools to get live BSE data and send email

So we did publish a package on PyPI which fetches BSE share prices. However, although the package was already released, it was not being used for the purpose it was created at first place. I was getting data from NSE which should now be changed to BSE as the package was up and running.

The code is more or less the same which we had for nsedata, the only difference now is we use bsetools package instead of nsetools. After installing bsetools with

pip install bsetools

We read the csv where all the details of quotes are stored. We use the get_quote function for every quote to get its respective prices from BSE website. We then create a data frame with required columns. We use the email_main function which is same as we have used it previously.

Euler Problem 54 – Poker Hands

Poker hands

Problem 54

For a change I am not copying the problem statement here because it is way too long. You can have a look at it Project Euler Problem 54 . I solved a project Euler problem after a long time. There are couple of reasons for that. First, I was busy with other things. I recently released a python package bsetools which gets share prices from BSE website, so most of the time went there. Then I finally shifted focus on solving this problem and guess what? This was so difficult to solve. There were so many cases, edge cases which needed to be handled. There were just so many possibilities which needed to be covered. On top of that I am not a poker player, so this game was quite new to me. I had some idea about it though but I haven’t played it to understand the small cases. Read it completely couple of times, talked with some of the colleagues in office who play poker regularly to get answer to some questions.

So after all those discussions and thinking, I thought of making different functions for all the cases mentioned in the question.  So, a separate function which checks if hand is a royal flush, a separate function for straight flush, four of a kind and so on. Initially I thought I would give ranks/points to these functions and then compare them against each other to decide which player has better ranked hand. For example – For Royal Flush we can give 9 points, straight flush 8 points, four of a kind 7 points and so on. So If player 1 has a straight he will get 4 points and player 2 has full house then he will get 6 points. So in this case player 2 would win as it has got higher point. This I thought was a good approach until more complications kicked in. What if there is a tie? If both player 1 and player 2 have a pair then we need to check which pair has a higher valued pair, if even then it is a tie then we need to check in those hand which player has got a higher value card. This is just one case of complication, there are many others for different conditions as well.

The functions which I had written just returned TRUE / FALSE and not the values. is_a_pair only returned TRUE/FALSE based on if it had a pair or not and not which value that pair contains. Now to handle the above cases we needed another function which was named as break_ties which breaks the ties and returns higher valued pairs. Each of function returns “Player 1” or “Player 2” according to higher ranked cards. We then calculate number of hands “Player 1” has won.

I divided this program into 3 separate files. Poker_card_supporting_functions.R has all the base functions like is_a_pair, is_straight, is_flush etc. Get_hand.R has the function get_hand which calls all the functions present in supporting functions file. It receives a hand and checks one by one if it is a pair, a flush , straight etc. It handles various cases as well. If there is a tie then the function break_ties is called.  Poker_Hands.R is the outside function which reads the poker txt file, divides each row into separate hands for player 1 and player 2 (first five columns are player 1 hands, last 5 are player 2’s) and then sends the two hands to get_hand function. It stores the return output (Player 1 or Player 2) in output variable and we calculate the frequency of their occurrence.

Looking back I think the code can be simplified definitely. There are lot of unnecessary function calls which can be reduced as well and made readable but now I had already given too much of time into this and was exhausted already and just wanted to get over with it. I was happy that I reached the answer at least.

source("/Users/Ronak Shah/Google Drive/Git-Project-Euler/51-60/54.Get_hand.R")

df <- read.table("/Users/Ronak Shah/Downloads/p054_poker.txt", sep = "", header = F, stringsAsFactors = FALSE)
output <- character(nrow(df))

for (i in 1:nrow(df)) {
  hands_1 <- as.character(df[i, 1:5, drop = TRUE])
  hands_2 <- as.character(df[i, 6:10, drop = TRUE])
  output[i] <- get_hand(hands_1, hands_2)
}

table(output)
#output
#Player 1 Player 2 
#376      624 

#user  system elapsed 
#2.32    0.00    2.35 

The complete can be found here.